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ABSTRACT 

Research in the genomic sciences is confronted 
with the volume of sequencing and resequencing 
data increasing at a higher pace than that of data 
storage and communication resources, shifting a 
significant part of research budgets from the 
sequencing component of a project to the compu- 
tational one. Hence, being able to efficiently store 
sequencing and resequencing data is a problem of 
paramount importance. In this article, we describe 
GReEn (Genome Resequencing Encoding), a tool 
for compressing genome resequencing data using 
a reference genome sequence. It overcomes some 
drawbacks of the recently proposed tool GRS, 
namely, the possibility of compressing sequences 
that cannot be handled by GRS, faster running 
times and compression gains of over 100-fold for 
some sequences. This tool is freely available 
for non-commercial use at ftp://ftp.ieeta.pt/~ap/ 
codecs/GReEnl .tar.gz. 

INTRODUCTION 

Inspired by the Biocompress algorithm of Grumbach 
and Tahi (1), the last two decades have witnessed the 
proposal of a myriad of algorithms for compressing 
genomic sequences [(2-16) for a recent review]. 
The acquired knowledge regarding genome structure 
that these compression algorithms have been providing, 
through their representation of genomic sequences 
using probabilistic models, is likely to surpass in relevance 
the benefits of the effective storage space reduction 
provided. 

One of the most successful compression algorithms spe- 
cifically designed for genomic sequences is XM, a statis- 
tical method proposed by Cao et al. (14), though other 
approaches may present competitive or even superior 
results for some classes of genomes (15,17). XM relies 
on a mixture of experts for providing symbol by symbol 
probability estimates that are fed to an arithmetic encoder. 



The XM algorithm comprises three types of experts: 
(i) order-2 Markov models; (ii) order- 1 context Markov 
models, i.e. Markov models that rely on statistical infor- 
mation from a recent past (typically, the 512 previous 
symbols); (iii) the copy expert, which considers the next 
symbol as part of a copied region from a particular offset. 
The probability estimates provided by the set of experts 
are then combined using Bayesian averaging and sent to 
the arithmetic encoder. 

Common practice continues to rely on standard and 
general purpose data compression methods, e.g. gzip or 
bzip2. However, this practice may be close to a turning 
point, as the rate at which genomic data is being produced 
is clearly overtaking the rate of increase in storage 
resources and communication bandwidth. 

The development of high-throughput sequencing 
technologies that offer dramatically reduced sequencing 
costs enables possibilities hardly foreseeable a decade 
ago (18). Large-scale projects such as the 1000 Genomes 
Project (http://www.1000genomes.org/) and The Cancer 
Genome Atlas (http://cancergenome.nih.gov/), as well as, 
prizes that reward cheaper, faster, less prone to errors and 
higher-throughput sequencing methodologies (e.g. http:// 
genomics.xprize.org/) are paving the way to individual 
genomics and personalized medicine (19). As such, huge 
volumes of genomic data will be produced in the near 
future. However, as a very significant part of the 
genome is shared among individuals of the same species, 
these data will be mostly redundant. Some ideas for 
storing and communicating redundant genomic data 
have already been put forward, based on, for example, 
single nucleotide polymorphism (SNP) databases (20), or 
insert and delete operations (21). 

Recently, Wang et al. (22) proposed a compression tool, 
GRS, that is able to compress a sequence using another 
one as reference without requiring any additional infor- 
mation about those sequences, such as, a reference SNPs 
map. The algorithm proposed by Kuruppu et al. (23) 
RLZ, is also able to perform relative Lempel-Ziv com- 
pression of DNA sequences, though its current implemen- 
tation cannot fully handle sequences that have characters 
outside the {a,c,g,t,n} set. 
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Other approaches propose encoding sequence reads 
output by massively parallel sequencing experiments 
(24-27), which is also a very important problem. The com- 
pression of short reads shares some points with the 
problem being addressed here, though it needs to cope 
with other requirements, such as, the efficient representa- 
tion of base calling quality information. 

In this article, we describe GReEn (Genome Resequenc- 
ing Encoding), a new tool for compressing genome 
resequencing data using a reference genome sequence. 
As such, it addresses the same problem as GRS (22), 
RLZ (23) or XM (14). However, as will be demonstrated, 
GReEn outperforms GRS in storage space requirements 
and running times, though GRS can handle some se- 
quences in a very effective way, and it overcomes RLZ's 
and XM's lack of support for arbitrary alphabets and 
inferior performance. 

GReEn is a compression tool based on arithmetic 
coding that handles arbitrary alphabets. Its running time 
depends only on the size of the sequence being com- 
pressed. Moreover, it provides compression gains of over 
100-fold for some sequences, when compared to GRS, and 
even larger gains when compared to RLZ. Finally, GReEn 
handles without restriction sequences that cannot be 
compressed with GRS due to excessive difference to the 
reference sequence. 

MATERIALS AND METHODS 

Dataset 

We use the same data as in (22), for ease of comparison 
with GRS: two versions of the first individual Korean 
genome sequenced, KOREF_20090131 and KOREF_ 
20090224 (28); two versions of the genome of the thale 
cress Arabidopsis thaliana, TAIR8 and TAIR9 (29,30); 
and two versions of the rice Oryza sativa genome, 
TIGR5.0 and TIGR6.0 (31). We also present results for 
four additional human genome assemblies, namely, the 
genome of J. Craig Venter referred to as HuRef (32), the 
Celera alternate assembly referred to as Celera (33), 
the genome of a Han Chinese individual referred to as 
YH (34), and the human genome reference assembly 
build 37. p2, as made available by the National Center 
for Biotechnology Information and referred to as 
NCBI37 (35). 

Software availability 

The codec (encoder/decoder) is implemented in the 
C programming language and is freely available for 
non-commercial purposes. It can be downloaded at 
ftp://ftp.ieeta.pt/~ap/codecs/GReEnl.tar.gz. 

The compression method 

As with GRS (22), GReEn relies on a reference sequence 
for compressing the target sequence. The reference 
sequence is generally only slightly different from the 
target sequence, although this is not mandatory. In fact, 
it is possible to use a sequence from a different species as 
reference, though, as expected, the compression efficiency 



depends on the degree of similarity between both reference 
and target sequences. Moreover, in order to recover 
the target sequence, the decoder needs access to exactly 
the same reference sequence as that used by the encoder. 

The codec developed in GReEn is able to handle arbi- 
trary alphabets, although it automatically ignores all lines 
beginning with the '>' character, as well as, all newline 
characters. We denote by C = {c\ , c 2 , , C\c\} the set of all dif- 
ferent characters, or symbols, that are found in the target 
sequence, where \C\ denotes the number of elements in C, 
i.e. the alphabet size. 

Each character of the target sequence is encoded by an 
arithmetic encoder (36). As with any arithmetic encoder, 
besides the symbol to encode, it is necessary to provide the 
probability distribution of the symbols. One major advan- 
tage of arithmetic coding is its ability to adjust the prob- 
abilistic model as the encoding proceeds, in response to 
the changing probability distribution from one encoded 
symbol to the next. 

We denote by 6(c) the relative frequency of character 
c e C in the target sequence, and by P„(c) the estimated 
probability of character c eC when encoding the character 
at position n in the target sequence. The set of 
probabilities {P n (c), c e C} are passed down to the arith- 
metic coder. Note that, whereas 0(c) values are fixed for a 
given target sequence, P„(c) values usually change along 
the coding process. For a sequence x N = x\x 2 x N , x t e C, 
with N characters, the arithmetic coder produces a bit- 
stream with 

N 

-J^log 2 P n (x n ) (1) 

«=1 

bits, which demonstrates the importance of providing 
good probability estimates to the arithmetic coder. 

The probability distribution, P„(c), can be provided by 
two different sources: (i) an adaptive model (the copy 
model) which assumes that the characters of the target 
sequence are an exact copy of (parts of) the reference 
sequence; (ii) a static model that relies on the frequencies 
of the characters in the target sequence, i.e. 6(c). The 
adaptive model is the main statistical model, as it allows 
a high compression rate of the target sequence, particu- 
larly in areas where the target and reference sequences are 
highly similar. However, this adaptive, or copy, model will 
at times not be used (the reasons why will be detailed 
shortly), and the static model will act as a fallback mech- 
anism, feeding the arithmetic coder with the required 
probability distribution. 

The copy model 

The copy model is inspired by the copy expert of the XM 
DNA compression method (14), relying on a pointer to a 
position in the reference sequence that has a 'good chance' 
of containing a character identical to that being encoded. 
As encoding of the target sequence proceeds, the pointer 
associated with the copy model may be repositioned to 
different locations of the reference sequence. When this 
repositioning occurs, all parameters of the model are 
reset. Besides accounting for the number of times, t n , 
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Target 



341 587 Current position 

Reference j j 

••• GGATAGGTAACGATATTCCTG ••• 
i 

327 829 

\ L 



■■• GGATAGGTAacgGTATTcct? 

Figure 1. The copy model. In this example, the copy model was re- 
started at position 341 587 of the reference sequence, corresponding 
to position 327 829 of the target sequence. Since then, it has cor- 
rectly predicted 5 characters, if the case is considered, and a total of 
1 1 characters if the case is ignored. The dashed arrow indicates a failed 
prediction. According to this example, the next character to be pre- 
dicted is 'G\ 



that the copy model was used after the previous repos- 
itioning, two more counters are maintained: h\ stores the 
number of times the model guessed the correct character 
including the correct case (uppercase or lowercase), and 
h\ records the number of times the model guessed the 
character but failed the case (for example, it guessed 'A' 
but the correct character was 'a'). 

Figure 1 exemplifies the operation of the copy model. 
Consider the most recent repositioning occurred at 
position 341 587 of the reference sequence, corresponding 
to position 327 829 of the target sequence (in this example, 
the reference is ahead of the target, but this may be dif- 
ferent in other cases). Assuming the codec is going to 
compress the character marked with '?', then the character 
predicted by the copy model would be 'G' (the one under 
the 'Current position' arrow), with t n = 12, h\ — 5 and 
h\ = 6. The characters linked by the dashed arrow 
indicate a prediction error (the predicted character was 
'A', whereas the correct one was 'G'). 

Computing the probabilities. Let us denote by p\ the char- 
acter predicted by the copy model ('G' in the example in 
Figure 1) and by p 2 n the case converted p\ ('g' according to 
the example in Figure 1). If p\,p\ e C (note that characters 
of the reference sequence that do not appear in the target 
sequence do not belong to C), the probabilities that are 
passed down to the arithmetic coder are given by 



Pn(c) 



hi 



1 



t„ + 3 ' 
hl + l 



for c = p n 



for 



Pn 



?„ + 3 

1 ~ P,M) - Pn(p 2 n ) 

l-6(p n )-9(pl) 



(2) 



0(c), tor c^p n ,p 2 n . 



The first two branches of Equation (2) correspond to 
Laplace probability estimators of the form 



P(£ k ) 



1 



£/< c C, 



(3) 



K 



where the £fcS form a set of K collectively exhaustive and 
mutually exclusive events, and Ng t denotes the number of 
times that event Ek has occurred in the past. In Equation 
(2) we considered three events, namely, £\ — {p\}, 
£ 2 = {p 2 n ] and E 3 = C\{p l n ,]%}. The third branch of 
Equation (2) defines how the probability assigned to 
£•}, i.e. 1 — P(£ i) — P(£2), is distributed among the indi- 
vidual characters of £3. This distribution is proportional 
to the relative frequencies of the characters, 0(c), after 
discounting the effect of treating p\ and p\ differently. 

If only p\ or p 2 n belongs to C, the probabilities are 
given by 



P„(c) 



1 



- 1 

Pn(p) 



for 



P 



(4) 



l-0(p) 



0(c), for c^p 



where h = h\ if p = p\, or h = h\ if p = p\. As such, 
we have considered only two events, namely, £ \ = {p} 
and £2 =C\{p}, where the distribution of probabilities 
among the characters of £% is performed as before. 

Finally, if both p) v p\ the probabilities communi- 
cated to the arithmetic coder are the character frequencies 
of the target sequence, i.e. 



P„(c) = 0(c). 



(5) 



A = l 



Starting and stopping the copy model. Typically, the codec 
starts by constructing a hash table with the occurrences 
and corresponding positions in the reference sequence of 
all A>mers of a given size (the default size is k = 11, but it 
can be changed using a command line option). Figure 2 
shows an example where k = 8 and k-mers 'CTNANGTC 
and 'AAAGTTGG' have been mapped by the hashing 
function into the same index (index 4 529 821). As usual 
in hashing schemes, disambiguation is achieved by direct 
comparison of the £-mers that originated the index, which 
have to be stored in the data structure in order to be 
compared. Using the hash table, it is easy to find in the 
reference sequence the characters that come right after all 
occurrences of a given k-mer. 

Before encoding a new character from the target 
sequence, the performance of the copy model, if in use, 
is checked. If t„ — h\ — h\ > mj, where m f is a parameter 
that indicates the maximum number of prediction failures 
allowed, the copy model is stopped. The default value for 
rrif is zero, but this may be changed through a command 
line option. 

Following this performance check, if the copy model is 
not in use, an attempt is made to restart the copy model 
before compressing the character. This is accomplished by 
looking for the positions in the reference sequence where 
the A;-mer composed of the £-most-recently-encoded char- 
acters occurs. If more than one position is found, the one 
closest to the encoding position is chosen. If none is found, 
the current character is encoded using the static model and 
a new attempt for starting a new copy model is performed 
after advancing one position in the target sequence. 
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231 568 

\ 



341 587 
_j_ 



681 789 



900 812 
_i_ 



CTNANGTC 



GGATAGGT 



CTNANGTC 



AAAGTTGG 



4 529 821 



CTNANGTC 


AAAGTTGG 


231 568 


900 812 


681 789 





8721 311 



GGATAGGT 



341 587 



Figure 2. Data organized in a hash table. 



Special case for equal size sequences. When the reference 
and target sequences have the same size, the codec 
assumes that both sequences are aligned. Therefore, 
whenever the copy model is restarted, it is forced to use 
the current encoding position as reference. This avoids 
constructing the hash table, hence, increasing the codec 
speed and generally producing better results. However, 
this mode of operation may be overridden by a 
command line option, as it may lead to poor performance 
for same-sized sequences that are not aligned. 



Table 1. Arabidopsis thaliana genome: compression of TAIR9 using 
TAIR8 as reference 



Chr 


Id 


Size 


GRS 




GReEn 


Bytes 


Sees 


Bytes Sees 


1 


1 1 


30427 671 


715 


7 


1551 13 


2 


1 1 


19 698 289 


385 


4 


937 8 


3 


10 


23 459 830 


2989 


6 


1097 9 


4 


7 


18 585 056 


1951 


5 


2356 7 


5 


5 


26975 502 


604 


6 


618 11 


Total 




119146 348 


6644 


28 


6559 48 



RESULTS 

We compare the performance of the method proposed 
here, GReEn, to the performance of GRS (22), the most 
recently proposed approach for compressing genome 
resequencing data that handles sequences drawn from 
arbitrary alphabets. We also include results of the RLZ 
algorithm (23) for some sequences, due to its restriction to 
sequences drawn from the alphabet {a, c, g, t, n}. 

Tables 1-3 present both the number of bytes produced 
and the time taken by the respective methods for com- 
pressing the sequences used in (22). The results regarding 
both the GRS and RLZ methds have been obtained using 
the software publicly provided by the authors. All experi- 
mental results were obtained using an Intel Core i7-2620M 
laptop computer at 2.7 GHz with 8 GB of memory and 
running Ubuntu 11.04. The best results are highlighted 
in boldface. 

Table 1 displays the compression results for the TAIR9 
version of the thale cress genome using the TAIR8 version 
as reference. Globally, GReEn required 6559 bytes for 
storing the sequences, whereas GRS needed a little more 
(6644 bytes). While GReEn took 48 s to encode the data, 
GRS needed only 28. Therefore, in this case, GRS is 
equivalent to the proposed method in terms of storage 
space, but faster. 

Table 2 displays the compression results for the 
TIGR6.0 version of the rice genome using the TIGR5.0 
version as reference. In this case, the outcome varies dra- 
matically to the previous results (Table 1). The first 



Size of the compressed target sequences (in bytes) and corresponding 
compression time (in seconds). The original sequence alphabets have 
been preserved. The |C| column indicates the size of the alphabet of the 
target sequence. 



significant difference can be observed in both the com- 
pressed size and compression time of chromosome 
1 : 1 502040 bytes in 708 s using GRS versus 4972 bytes 
in 18 s (more than a 300-fold improvement) using 
GReEn. A similarly significant difference can be 
observed in chromosome 11 (with a gain of over 
160-fold). Globally, GRS required 4901902 bytes and 
2290 s, whereas GReEn was able to store the entire 
genome in just 125 535 bytes (39-fold improvement) 
using only 123 s of computing time. 

The main conclusion from these results is that, under 
certain conditions not yet investigated, GRS fails to find 
large-scale similarities between the two sequences. 
Therefore, the number of bytes generated is much larger 
than necessary and, probably as a consequence, the 
running time explodes. Moreover, when the target 
sequence is exactly equal to the reference sequence (as in 
chromosomes 6, 9 and 12), the GRS reports a number of 
bytes that is essentially zero [in (22) they are shown as 
zero, although we opted to display the number of bytes 
effectively used], while GReEn uses a few hundred bytes. 
However, if critical, this could be easily reduced to almost 
zero using a sequence comparison before starting encoding 
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Table 2. Oryza sativa genome: compression of TIGR6.0 using 
TIGR5.0 as reference 



p Vil- 
lain 




jize 


RLZ 






GReEn 


Bytes 


Sees 


Bytes 


Sees 


Bytes 


Sees 


1 

1 


J 


4 J ZOO 5 /V 


185 715 


35 


1 A/in 


/Uo 


4 972 


18 


2 


5 


35 930 381 


210295 


28 


1 409 


5 


1906 


14 


3 


6 


36406 689 






47 764 


28 


17 890 


15 


4 


5 


35278225 


175 663 


27 


36 145 


20 


6 750 


14 


5 


5 


29 894 789 


120 625 


21 


6 177 


5 


5 539 


12 


6 


5 


31246789 


61038 


23 


14 


4 


482 


2 


7 


5 


29 696 629 


167822 


21 


4067 


8 


2448 


12 


8 


5 


28 439 308 


109 608 


20 


118 246 


43 


9 507 


11 


9 


5 


23 011239 


44 953 


16 


14 


4 


366 


2 


10 


9 


23 134 759 






788 542 


339 


60449 


9 


11 


11 


28 512 666 






2 397 470 


1 122 


14 797 


12 


12 


5 


27497214 


53714 


19 


14 


4 


429 


2 


Total 




372 317 567 






4 901902 


2 290 


125 535 


123 



Size of the compressed target sequences (in bytes) and corresponding 
compression time (in seconds). The original sequence alphabets have 
been preserved. The |C| column indicates the size of the alphabet of the 
target sequence. The missing RLZ values correspond to sequences with 
characters that cannot be handled by the current implementation of 
this algorithm. 

(note that, due to the requirement that the probabilities 
communicated to the arithmetic coder should be repre- 
sented as integers, a lower bound exists in the minimum 
number of bits that can be generated in each coding step). 

Table 3 displays the compression results for the 
KOPvEF_20090224 version of the human genome using 
the KOREF_20090131 version as reference. In this case, 
GReEn gives consistently better results, both in terms of 
storage requirements and computing time. In fact, this 
latter aspect deserves a special note because contrarily to 
GRS, the running time of GReEn varies linearly with the 
size of the sequence. Therefore, GReEn allows for an a 
priori good estimate of the time that is required to 
compress a given sequence. 

Besides considering the datasets in (22), we also inves- 
tigate four human genome assemblies, in order to provide 
a more comprehensive comparison of both GRS and 
GReEn compression approaches. However, our intention 
fell short because GRS failed to compress most of the 
sequences due to an excessive difference between the ref- 
erence and target sequences. Table 4 displays the results 
obtained when the YH genome was compressed using 
KOREF_20090224 as reference. It is clear that GRS 
gave unacceptable results, both regarding the size of the 
compressed sequences and the time required to compress 
them, for the few chromosomes that could be compressed 
with GRS. 

Table 5 displays the compression results, using GReEn, 
for four different human genome assemblies (HuRef, 
Celera, YH and KOREF_20090224) using the NCBI37 
version as reference. As this article is about sequence com- 
pression, not sequence analysis, we refrain from elaborat- 
ing too much on the differences observed. Nevertheless, 
we hint at what we believe may be possible explanations. 
First, the HuRef and Celera assemblies are not 
resequencing assemblies and this, per se, accounts for 



Table 3. Homo sapiens genome: compression of KOREF_20090224 
using KOREF_20090131 as reference 
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Size of the compressed target sequences (in bytes) and corresponding 
compression time (in seconds). The original sequence alphabets have 
been preserved. The size of the alphabet in the target sequence is 21 for 
all chromosomes, except for the M chromosome where it is 11. 



greater compression differences with respect to the refer- 
ence assembly. 

The HuRef assembly is an individual genome sequenced 
with capillary-based whole-genome shotgun technologies 
and de novo assembled with the Celera Assembler. Hence, 
this assembly is the farthest apart (i.e. with a larger 
number of bytes required for its compression) from the 
reference NCBI37 assembly. 

The Celera assembly represents one of the two pioneer- 
ing efforts in sequencing a human genome. Its consensus 
sequence is derived from the genomes of five individuals 
using a capillary-based whole-genome shotgun sequencing 
approach. Unlike the reference assembly generated by the 
International Human Genome Sequencing Consortium 
(here represented in the NCBI37 assembly), which used 
a clone-based hierarchical shotgun strategy that is more 
likely to output a high-quality finished genome sequence 
as the sequence assembly is local and anchored to the 
genome, the Celera Genomics Sequencing Team opted 
for a whole-genome shotgun strategy where sequence 
contigs and scaffolds must be individually anchored to 
the genome, rendering assembly more complex and more 
prone to long-range misassembly. Moreover, this whole- 
genome shotgun assembly resulted from a combined 
analysis of the genomic data generated by the Celera 
Genomics Sequencing Team and some data generated 
by the International Human Genome Sequencing 
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Table 4. Homo sapiens genome: compression of YH using 
KOREF 20090224 as reference 
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Size of the compressed target sequences (in bytes) and corresponding 
compression time (in seconds). The original sequence alphabets have 
been preserved. The missing values are due to the inability of GRS to 
compress sequences differing more than a predefined value. 



Table 5. Homo sapiens genome: compression with GReEn of the 
HuRef, Celera, YH and KOREF_20090224 versions using the 
NCBI37 as reference 
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Number of bytes after compressing each sequence. For ease of com- 
parison we transformed all characters to lowercase and mapped all 
unknown nucleotides to 'n' before compression. Therefore, after this 
transformation, all sequences were composed only of characters from 
the alphabet {a,c,g,t,n}. 



Consortium, hence, it has been claimed that this Celera 
assembly is not a totally independent human genome 
assembly (37). We believe this may be part of the explan- 
ation for the smaller compression values in Table 5 with 
respect to this assembly, than those of the HuRef 
assembly. 

The YH assembly is an individual genome based on 
resequencing data from massively parallel sequencing 
technologies and assembled with the Short Oligonucleo- 
tide Alignment Program, using the NCBI human genome 
assembly as reference. Essentially, it is a map of SNPs with 
respect to the reference assembly, hence it displays very 
low compression values in Table 5. 

The KOREF 20090224 assembly is also an individual 
genome based on resequencing data from massively 
parallel sequencing technologies and assembled with the 
Mapping and Assembly with Qualities program, using the 
NCBI human genome assembly as reference. As with the 
YH assembly, resequencing renders the resulting assembly 
very redundant with respect to the reference (NCBI37) 
assembly, hence also displaying very low compression 
values in Table 5. 

The compression values for chromosome 19 in the YH 
and KOREF_20090224 assemblies are unexpectedly high. 
This chromosome has the highest GC content (48.4%) 
and the lowest (median) sequence depth (28-fold) in the 
YH genome (34), hence constraining the quality of the 



final sequence. Not surprisingly, chromosome 19 in the 
YH genome has a very large number (more than twice 
those of the reference NCBI37 assembly) of unsequenced 
bases ('N' symbols in our encoding). Chromosome 19 in 
the KOREF_20090224 assembly faces the same hurdles, 
which we assume to be a consequence of the similar 
sequencing methodology. 

Finally, Table 6 displays again the compression results 
for the KOREF_20090224 version of the human genome 
using the KOREF_20090131 version as reference. 
However, for allowing the comparison of GReEn to 
GRS and RLZ on a larger genome, we converted the se- 
quences to the {a,c,g,t,n} alphabet. 



DISCUSSION 

The GRS tool recently introduced by Wang et al. (22) for 
compressing DNA resequencing data using a reference 
sequence allows to significantly reduce data storage 
space requirements. However, this tool seems to be effect- 
ive only when the target sequence is very similar to the 
reference sequence, preventing the compression of many 
sequences of interest. Moreover, as we have shown, for 
example, in chromosomes 1 and 11 of the TIGR6.0 
version of the rice genome, it may fail to give reasonable 
results even for similar sequences. Another drawback of 
GRS is that the encoding time does not depend only on 
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Table 6. Homo sapiens genome: compression of KOREF_20090224 
using KOREF_20090131 as reference 
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Number of bytes after compressing each sequence. For allowing the 
comparison to RLZ and GRS, all characters were transformed to 
lowercase before compression and all unknown nucleotides were 
mapped to 'n'. Therefore, after this transformation, all sequences 
were composed only of characters from set {a,c,g,t,n}. 



the sequence size, but mainly on the similarity between the 
target and reference sequences (lower similarity implying 
greater compression times), resulting in a large unpredict- 
ability regarding the time that a certain sequence requires 
to be compressed. 

To overcome these limitations, we propose a statistical 
compression method that uses a probabilistic copy model. 
The probabilities are estimated for every character of the 
target sequence and are used to feed an arithmetic coder. 
The compression tool has two control parameters, 
namely, the size of the £-mer that is used for searching 
copies (with a default value of k = 11), and the number of 
prediction failures that are tolerated by the copy model 
before it is restarted (with a default value of 0). Changing 
these parameters may change the performance of the 
codec, degrading the performance for some sequences 
while improving it for others. It is left to the user the 
decision of trying to optimize these parameters or, as we 
have done when producing the experimental results 
included in this article, to use the default values. 



CONCLUSION 

In this article, we described a computational tool, GReEn, 
aiming at compressing genome resequencing data using 
another sequence as reference. This tool is able to 
handle arbitrary alphabets and does not pose any 



restrictions or requirements on the sequences to 
compress. Several examples of its efficiency in compressing 
genomic data and its improvements with respect to other 
recently proposed tools have been included, rendering 
evident the practical interest of the tool here proposed. 

With the generation of increasingly larger volumes of 
genome sequencing and resequencing data, and the 
increasing costs associated to storing and transmitting 
those data, compression tools that efficiently recognize 
redundancies are in demand. However, the interest in 
such compression methodologies goes beyond data 
storage and communication. By being a probabilistic 
model of the underlying genomic sequence(s), compres- 
sion tools reveal similarities and differences that are para- 
mount for studies of human genomic variation between 
individuals, hence, key for progress in personal medicine 
efforts. 



FUNDING 

European Fund for Regional Development (FEDER) 
through the Operational Program Competitiveness 
Factors (COMPETE); Portuguese Foundation for 
Science and Technology (FCT) through project grants 
FCOMP-01-0124-FEDER-0 10099 (FCT reference 
PTDC/EIA-EIA/103099/2008) and FCOMP-01-0124- 
FEDER-022682 (FCT reference PEst-C/EEI/UI0127/ 
2011); European Social Fund (to S.P.G.); Portuguese 
Ministry of Education and Science (to S.P.G.). Funding 
for open access charge: Portuguese Foundation for 
Science and Technology (FCT) project grant FCOMP- 
01-0124-FEDER-022682 (FCT reference PEst-C/EEI/ 
UI0127/2011). 

Conflict of interest statement. None declared. 



REFERENCES 

1. Grumbach,S. and Tahi,F. (1993) Compression of DNA sequences. 
In Proceedings of the Data Compression Conference, DCC-93. 
IEEE, Snowbird. Utah, pp. 340-350. 

2. Grumbach,S. and Tahi,F. (1994) A new challenge for 
compression algorithms: genetic sequences. Inform. Process. 
Manag., 30, 875-886. 

3. Rivals,E., Delahaye,J.-P., Dauchet,M. and Delgrange,0. (1996) 
A guaranteed compression scheme for repetitive DNA sequences. 
In Proceedings of the Data Compression Conference, DCC-96. 
IEEE, Snowbird. Utah, p. 453. 

4. Loewenstern,D. and Yianilos,P.N. (1997) Significantly lower 
entropy estimates for natural DNA sequences. In Proceedings of 
the Data Compression Conf, DCC-97. IEEE, Snowbird. Utah, 
pp. 151-160. 

5. Chen,X., Kwong,S. and Li,M. (1999) A compression algorithm 
for DNA sequences and its applications in genome comparison. 
In Asai,K., Miyano.S. and Takagi,T. (eds), Genome Informatics 
1999: Proc. of the 10th Workshop. Universal Academy Press, Inc, 
Tokyo, Japan, pp. 51-61. 

6. Matsumoto,T., Sadakane,K. and Imai,H. (2000) Biological 
sequence compression algorithms. In Dunker.A.K., Konagaya,A., 
Miyano.S. and Takagi,T. (eds), Genome Informatics 2000: 
Proceedings of the 11th Workshop. Tokyo, Japan, pp. 43-52. 

7. Chen,X., Kwong,S. and Li,M. (2001) A compression algorithm 
for DNA sequences. IEEE Eng. Med. Biol. Mag., 20, 61-66. 



e27 Nucleic Acids Research, 2012, Vol. 40, No. 4 



Page 8 of 8 



8. Chen,X., Li,M., Ma,B. and Tromp.J. (2002) DNACompress: 
fast and effective DNA sequence compression. Bioinformatics, 18, 
1696-1698. 

9. Tabus,I., Korodi.G. and Rissanen,J. (2003) DNA sequence 
compression using the normalized maximum likelihood model for 
discrete regression. In Proceedings of the Data Compression 
Conference, DCC-2003. Snowbird. Utah, pp. 253-262. 

10. Manzini,G. and Rastero,M. (2004) A simple and fast DNA 
compressor. Softw. Pract. Exp, 34, 1397-1411. 

11. Korodi.G. and Tabus,I. (2005) An efficient normalized maximum 
likelihood algorithm for DNA sequence compression. ACM T. 
Inform. Syst., 23, 3-34. 

12. Behzadi,B. and Le Fessant,F. (2005) DNA compression challenge 
revisited. In Combinatorial Pattern Matching: Proceedings of 
CPM-2005, Vol. 3537 of LNCS. Springer, Jeju Island, Korea, 

pp. 190-200. 

13. Korodi,G. and Tabus,I. (2007) Normalized maximum likelihood 
model of order- 1 for the compression of DNA sequences. In 
Proceedings of the Data Compression Conference, DCC-2007. 
IEEE, Snowbird. Utah, pp. 33—42. 

14. Cao,M.D., Dix,T.I., Allison.L. and Mears.C. (2007) A simple 
statistical algorithm for biological sequence compression. 
Proceedings of the Data Compression Conference, DCC-2007 . 
IEEE, Snowbird. Utah, pp. 43-52. 

15. Pinho,A.J., Ferreira.P.J.S.G, Neves,A.J.R. and Bastos.C.A.C. 
(2011) On the representability of complete genomes by 
multiple competing finite-context (Markov) models. PLoS ONE, 
6, e21588. 

16. Giancarlo.R., Scaturro.D. and Utro.F. (2009) Textual data 
compression in computational biology: a synopsis. Bioinformatics, 
25, 1575-1586. 

17. Pinho,A.J., Pratas.D. and Ferreira.P.J.S.G. (2011) Bacteria DNA 
sequence compression using a mixture of finite-context models. In 
Proceedings of the IEEE Workshop on Statistical Signal 
Processing, IEEE, Nice, France. 

18. Lander,E.S. (2011) Initial impact of the sequencing of the human 
genome. Nature, 470, 187-197. 

19. VenterJ.C. (2010) Multiple personal genomes await. Nature, 464. 
676-677. 

20. Christley.S., Lu,Y., Li.C. and Xie.X. (2009) Human genomes as 
email attachments. Bioinformatics, 25, 274-275. 

21. Brandon.M.C, Wallace.D.C. and Baldi.P. (2009) Data structures 
and compression algorithms for genomic sequence data. 
Bioinformatics, 25, 1731-1738. 

22. Wang,C. and Zhang.D. (2011) A novel compression tool for 
efficient storage of genome resequencing data. Nucleic Acids Res., 
39, e45. 

23. Kuruppu.S., Puglisi,S.J. and Zobel.J. (2011) Optimized 
relative Lempel-Ziv compression of genomes. In Reynolds.M. 
(ed.), Optimized relative Lempel-Ziv compression of genomes. 
Proceeding, of ACSC 2011, 34th Australasian Computer Science 
Conference (ACSC 2011), Conferences in Research and Practice in 



Information Technology (CRPIT). Vol. 113, Australian Computer 
Society Inc, Perth Australia. 

24. Tembe.W., LoweyJ. and Suh.E. (2010) G-SQZ: compact 
encoding of genomic sequence and quality data. Bioinformatics, 
26, 2192-2194. 

25. Deorowicz.S. and Grabowski,S. (2011) Compression of DNA 
sequence reads in FASTQ format. Bioinformatics, 27, 860-862. 

26. Fritz,M.H.-Y., Leinonen.R., Cochrane.G. and Birney,E. (2011) 
Efficient storage of high throughput DNA sequencing data using 
reference-based compression. Genome Res., 21, 734-740. 

27. Kozanitis.C, Saunders,C, Kruglyak.S., Bafna,V. and Varghese,G. 
(2011) Compressing genomic sequence fragments using SlimGene. 
/. Comput. Biol., 18, 401^113. 

28. Ahn,S.-M., Kim,T.-H., Lee.S., Kim,D., Ghang.H., Kim,D.-S., 
Kim,B.-C, Kim,S.-Y., Kim,W.-Y., Kim,C. et al. (2009) The first 
Korean genome sequence and analysis: full genome sequencing 
for a socio-ethnic group. Genome Res., 19, 1622-1629. 

29. Huala,E., Dickerman,A.W., Garcia-Hernandez,M., Weems,D., 
Reiser,L., LaFond.F., Hanley.D., Kiphart.D., Zhuang,M., 
Huang,W. et al. (2001) The Arahidopsis Information Resource 
(TAIR): a comprehensive database and web-based information 
retrieval, analysis, and visualization system for a model plant. 
Nucleic Acids Res., 29, 102-105. 

30. Rhee.S.Y., Beavis.W., Berardini,T.Z., Chen,G., Dixon.D., 
Doyle,A., Garcia-Hernandez,M., Huala.E., Lander.G., 
Montoya,M. et al. (2003) The Arahidopsis Information 
Resource (TAIR): a model organism database providing a 
centralized, curated gateway to Arahidopsis biology, research 
materials and community. Nucleic Acids Res., 31, 224-228. 

31. Ouyang,S., Zhu.W., Hamilton.J., Lin.H., Campbell.M., Childs,K., 
Thibaud-Nissen,F., Malek,R.L., Lee.Y., Zheng.L. et al. (2007) 
The TIGR Rice Genome Annotation Resource: improvements 
and new features. Nucleic Acids Res., 35, D883-D887. 

32. Levy,S., Sutton,G., Ng,P.C, Feuk,L., Halpern.A.L., Walenz,B.P., 
Axelrod,N., Huang,J., Kirkness.E.F., Denisov,G. et al. (2007) 
The diploid genome sequence of an individual human. PLoS 
Biol, 5, 2113-2144. 

33. VenterJ.C, Adams.M.D., Myers.E.W., Li,P.W., Mural,R.J., 
Sutton,G.G., Smith,H.O., Yandell,M., Evans.C.A., Holt,R.A. 
et al. (2001) The sequence of the human genome. Science, 291, 
1304-1351. 

34. Wang,J., Wang.W., Li,R., Li,Y., Tian,G., Goodman.L., Fan,W., 
ZhangJ., Li,J., Zhang.J. et al. (2008) The diploid genome 
sequence of an Asian individual. Nature, 456, 60-66. 

35. The International Human Genome Sequencing Consortium (2001) 
Initial sequencing and analysis of the human genome. Nature, 
409, 860-921. 

36. RissanenJ. (1976) Generalized Kraft inequality and arithmetic 
coding. IBM J. Res. Develop, 20, 198-203. 

37. Waterston,R.H., Lander,E.S. and SulstonJ.E. (2002) On the 
sequencing of the human genome. Proc. Natl Acad. Sci. USA, 99, 
3712-3716. 



